LWIP TCP socket close - disconnecting fix #10476

tymoteuszblochmobica · 2019-04-25T11:38:38Z

Description

Currently LWIP TCP socket close doesn't care TCP close handshaking is properly completed.
If network interface disconnect follows immediately after TCP socket_close and TCP FSM is in
ESTABLISHED state this causes too early eth/wifi driver stop.
In result FIN ACK sequence is being corrupted and TCP FSM stucks in FIN_WAIT_1 state until LWIP netif is deleted. This is a reason of failing TCP tests due to lack of memory because TCP PCB mempool is not freed on time.

This fix check if socket is TCP and FSM is in ESTABLISHED state.
Then give extra time for connection close handshaking until TIME_WAIT state or timeout occurs.

Pull request type

[x ] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@SeppoTakalo
@mikaleppanen
@kjbracey-arm

Release Notes

ciarmcom · 2019-04-25T15:01:17Z

@tymoteuszblochmobica, thank you for your changes.
@SeppoTakalo @kjbracey-arm @mikaleppanen @ARMmbed/mbed-os-ipcore @ARMmbed/mbed-os-maintainers please review.

SeppoTakalo · 2019-04-26T08:40:37Z

@mikaleppanen @kjbracey-arm Any opinions about this?

For me, this sounds like a dirty workaround. We are going to do a busy loop, where non-blockiness is expected.

This surely fix the issue, but I wonder if there is any more elegant solution?

kjbracey · 2019-04-26T09:10:58Z

features/lwipstack/LWIPStack.cpp

+        do {
+            wait_ms(5);
+            check_limit--;
+        } while (tcp != NULL && tcp->state < TIME_WAIT && check_limit);


This is potentially racey - the PCB that tcp is pointing at could be deallocated/reused while you're waiting, because you called delete.

It seems that lwIP does have netconn_shutdown - use that prior to the wait, and then keep monitoring s->conn->pcb.tcp. (I'm guessing that can become null due to the shutdown):

Something like:

if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP) { netconn_shutdown(s->conn, true, true); while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) { wait(); } }

Still got a minor/theoretical thread-safety issue on polling s->conn->pcb.tcp though - not sure about how we lock against the tcpip thread. Is there a callback we could get to know when the state changes?

Alternatively, have you considered configuring SO_LINGER in lwIP? Even if we don't expose the option via NSAPI's setsockopt, you could probably get lwIP to give this wait behaviour by setting it based on your config option.

I think you'd have to poke the option flag and linger timeout value directly in, imitating the code in lwip_getsockopt_impl, and then mark the socket blocking before the delete to activate it.

kjbracey · 2019-04-26T10:20:46Z

@SeppoTakalo - feels a bit icky to me too, but probably not that bad in practice. Don't forget that lwip's connect is also blocking when it shouldn't be...

And many off-board stacks probably do stall for a while in their "close" operations.

This is at least controllable via an option, and doesn't preclude adding more direct control later like public SO_LINGER setsockopt API.

0xc0170 · 2019-05-02T12:58:42Z

@tymoteuszblochmobica Please update

tymoteuszblochmobica · 2019-05-02T16:29:58Z

I checked that

if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP) {
netconn_shutdown(s->conn, true, true);
while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) {
wait();
}
}

doesn't work due to :

netconn_shutdown also calls lwip_netconn_do_close_internal which sets

conn->pcb.tcp to NULL

So it doesn't provide useful information about PCB tcp FSM and waiting loop exits immediately.

Unfortunately there is no callback from tcp FSM state change.

Making socket blocking and enabling LINGER option also cannot work without LWIP code modification.
SO_LINGER requires unsent or unacked data to trigge tcp_pooling
if ((conn->linger >= 0) && (conn->pcb.tcp->unsent || conn->pcb.tcp->unacked)) {

I this case there is no unsent and unacked data and tcp_shutdown sends FIN to server and immediately returns.

Maybe my solution proposal is ugly but i wanted to keep LWIP code unchanged.

kjbracey · 2019-05-03T07:42:37Z

Okay, that shutdown implementation is unfortunate. As the comment in "netconn_close" says "shutting down both ends is the same as closing". :( Why???

You could just shutdown write and then poll - that is what actually is important in terms of the TCP protocol and state machine - it sends the FIN. Shutting down read is an assertion that you don't expect any more data (just a FIN) and makes it send RST if they do (to indicate what they sent has been ignored). That's less critical - it doesn't directly affect the state machine, except in that "drop with reset" case where more data arrives.

If data does come in while you're polling, because you only shutdown write, I believe it should still send a RST when you close without having read it. (rst_on_unacked_data in tcp_close_shutdown).

tymoteuszblochmobica · 2019-05-06T15:38:53Z

If only shutdown write is selected netconn_shutdown(s->conn, false, true) works now as we expected.
It doesn't set conn->pcb.tcp to NULL until call lwip_netconn_do_close, and now waiting loop works fine.

kjbracey

Glad that technique works. Should be fine. Bit more refinement.

kjbracey · 2019-05-07T10:30:59Z

features/lwipstack/mbed_lib.json

@@ -100,6 +100,10 @@
            "help": "Maximum number of retransmissions of SYN segments. Current default (used if null here) is set to 6 in opt.h",
            "value": null
        },
+        "tcp-close-timeout": {
+            "help": "Maximum number of 5ms intervals for TCP close handshaking timeout",


Make the config be milliseconds - exposing 5ms interval multiple is a bit weird. Even if you have 5ms granularity, you can subtract 5 from check_limit the loop (but make it signed).

Or I'd be quite happy with 1ms polling anyway; I think being more responsive to close - shutting down 4ms faster - is probably more of a power saving than only checking every 5 milliseconds.

kjbracey · 2019-05-07T10:31:58Z

features/lwipstack/LWIPStack.cpp

+    if (NETCONNTYPE_GROUP(s->conn->type) == NETCONN_TCP && s->conn->pcb.tcp->state == ESTABLISHED) {
+        netconn_shutdown(s->conn, false, true);
+        while (s->conn->pcb.tcp && s->conn->pcb.tcp->state < TIME_WAIT && check_limit) {
+            wait_ms(5);


ThisThread::sleep_for(5) is better - wait_ms is on the way to deprecation.

Although, I wonder if you now get an event from lwIP when the state changes, so you can do an EventFlags::wait(TCP_CLOSE_TIMEOUT) rather than a loop. Probably no biggie.

tymoteuszblochmobica · 2019-05-09T16:09:37Z

Replaced wait_ms ThisThread::sleep_for(1 ms)

Tried to use callback for EventFlags eg. my_event_member::set(TCP_FLAG) and use my_event_member.wait_any(TCP_FLAG,timeout) instead of waiting loop.

Calling sequence:
tcp_input
TCP_EVENT_CLOSED( )
pcb->recv assigned recv_tcp

recv_tcp calls
sys_mbox_trypost(&conn->recvmbox ) -not used yet it requires more code modify
or
API_EVENT(conn->NETCONN_EVT_RCV_PLUS
netconn->callback assigned LWIP::socket_callback for setting flag my_event_member::set(TCP_FLAG)

It exits my_event_member.wait_any(TCP_FLAG only on timeout and LWIP::socket_callback is blocked (but its called from tcpip_thread context ).
Do i miss something?

kjbracey · 2019-05-10T12:51:27Z

Not quite following your comment. Sounds like you attempted what I suggested - setting an event flag in socket_callback. My hope was that the callback would be called on the state transition you cared about so you could do something like if (nc is TCP and finished) EventFlags::set(FLAG_TCP_FINISHED) in socket_callback.

That should work, but only if lwIP is actually sending an event at the right time. Maybe it isn't. If not, not worth messing with it.

If TCP FSM is in ESTABLISHED state, waits for TCP close handshaking until TIME_WAIT The purpose is to prevent eth/wifi driver stop and FIN ACK corrupt. This may happend if network interface disconnect follows immediately after socket_close.

tymoteuszblochmobica · 2019-05-10T15:24:07Z

After code cleanup and applying EventFlag again it started working as expected.
Probably code revert removed spurious changes used for tests.

tymoteuszblochmobica · 2019-05-15T13:19:07Z

Kevin could You check is it OK?

0xc0170 · 2019-05-20T09:35:07Z

CI started

mbed-ci · 2019-05-20T12:34:56Z

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 1
Build artifacts

ciarmcom requested review from kjbracey, mikaleppanen, SeppoTakalo and a team April 25, 2019 15:01

ciarmcom added the needs: review label Apr 25, 2019

kjbracey suggested changes Apr 26, 2019

View reviewed changes

0xc0170 added needs: work and removed needs: review labels Apr 29, 2019

tymoteuszblochmobica force-pushed the closetcp branch from c7c5bc3 to 9bb4526 Compare May 6, 2019 15:33

kjbracey suggested changes May 7, 2019

View reviewed changes

tymoteuszblochmobica force-pushed the closetcp branch from 9bb4526 to c1230bd Compare May 9, 2019 15:44

tymoteuszblochmobica force-pushed the closetcp branch 2 times, most recently from 1806702 to 7fb87bd Compare May 10, 2019 14:55

Fixed TCP connection close.

51610cc

If TCP FSM is in ESTABLISHED state, waits for TCP close handshaking until TIME_WAIT The purpose is to prevent eth/wifi driver stop and FIN ACK corrupt. This may happend if network interface disconnect follows immediately after socket_close.

tymoteuszblochmobica force-pushed the closetcp branch from 7fb87bd to 51610cc Compare May 10, 2019 15:01

kjbracey approved these changes May 15, 2019

View reviewed changes

adbridge added needs: CI and removed needs: work labels May 17, 2019

SeppoTakalo approved these changes May 21, 2019

View reviewed changes

0xc0170 added ready for merge release-version: 5.13.0-rc1 and removed needs: CI labels May 21, 2019

0xc0170 merged commit 02eaad4 into ARMmbed:master May 21, 2019

0xc0170 removed the ready for merge label May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LWIP TCP socket close - disconnecting fix #10476

LWIP TCP socket close - disconnecting fix #10476

tymoteuszblochmobica commented Apr 25, 2019

ciarmcom commented Apr 25, 2019

SeppoTakalo commented Apr 26, 2019

kjbracey Apr 26, 2019

kjbracey commented Apr 26, 2019

0xc0170 commented May 2, 2019

tymoteuszblochmobica commented May 2, 2019 •

edited

kjbracey commented May 3, 2019

tymoteuszblochmobica commented May 6, 2019 •

edited

kjbracey left a comment

kjbracey May 7, 2019

kjbracey May 7, 2019

tymoteuszblochmobica commented May 9, 2019

kjbracey commented May 10, 2019

tymoteuszblochmobica commented May 10, 2019

tymoteuszblochmobica commented May 15, 2019 •

edited

0xc0170 commented May 20, 2019

mbed-ci commented May 20, 2019

LWIP TCP socket close - disconnecting fix #10476

LWIP TCP socket close - disconnecting fix #10476

Conversation

tymoteuszblochmobica commented Apr 25, 2019

Description

Pull request type

Reviewers

Release Notes

ciarmcom commented Apr 25, 2019

SeppoTakalo commented Apr 26, 2019

kjbracey Apr 26, 2019

Choose a reason for hiding this comment

kjbracey commented Apr 26, 2019

0xc0170 commented May 2, 2019

tymoteuszblochmobica commented May 2, 2019 • edited

kjbracey commented May 3, 2019

tymoteuszblochmobica commented May 6, 2019 • edited

kjbracey left a comment

Choose a reason for hiding this comment

kjbracey May 7, 2019

Choose a reason for hiding this comment

kjbracey May 7, 2019

Choose a reason for hiding this comment

tymoteuszblochmobica commented May 9, 2019

kjbracey commented May 10, 2019

tymoteuszblochmobica commented May 10, 2019

tymoteuszblochmobica commented May 15, 2019 • edited

0xc0170 commented May 20, 2019

mbed-ci commented May 20, 2019

Test run: SUCCESS

tymoteuszblochmobica commented May 2, 2019 •

edited

tymoteuszblochmobica commented May 6, 2019 •

edited

tymoteuszblochmobica commented May 15, 2019 •

edited